Task 0

Author

Amelia Shuler - id document: Z0710125M

Instructions (read before starting)

  • Modify inside the header of .qmd document your personal data (name and ID) located in the header of the file.

  • Do not touch anything else in the header (note that I have included embed-resources: true so that everything is contained in a single html without extra files, and theme: [style.scss] to give a cuckoo style to the delivery with the style.scss file in the folder)

  • Make sure, BEFORE further editing the document, that the .qmd file is rendered correctly and the corresponding .html is generated in your local folder on your computer.

  • The chunks (code boxes) created are either empty or incomplete, hence most of them have the #| eval: false option. Once you edit what you consider, you must change each chunck to #| eval: true (or remove it directly) to run them

  • Remember that you can run chunk by chunk with the play button or run all chunks up to a given chunk (with the button to the left of the previous one)

  • Only the generated .html will be evaluated.

  • Be careful with spaces and line breaks!

Required packages

We will need the following packages (play on the chunk to load them):

rm(list = ls()) # Remove old variables
library(glue)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(tibble)
library(lubridate)

Attaching package: 'lubridate'
The following objects are masked from 'package:base':

    date, intersect, setdiff, union

Case study: analysis of Taylor Swift’s songs

Exercise 1

Install the {taylor} package (in your console, not in your chunk!) and then load it.

# install.packages("taylor")
library(taylor)

In that package we have the dataset taylor_album_songs with the characteristics of Taylor Swift’s album songs (songs outside of albums are excluded and only songs owned by Taylor Swift are considered).

# just this, it is a tibble
taylor_album_songs 
# A tibble: 240 × 29
   album_name   ep    album_release track_number track_name     artist featuring
   <chr>        <lgl> <date>               <int> <chr>          <chr>  <chr>    
 1 Taylor Swift FALSE 2006-10-24               1 Tim McGraw     Taylo… <NA>     
 2 Taylor Swift FALSE 2006-10-24               2 Picture To Bu… Taylo… <NA>     
 3 Taylor Swift FALSE 2006-10-24               3 Teardrops On … Taylo… <NA>     
 4 Taylor Swift FALSE 2006-10-24               4 A Place In Th… Taylo… <NA>     
 5 Taylor Swift FALSE 2006-10-24               5 Cold As You    Taylo… <NA>     
 6 Taylor Swift FALSE 2006-10-24               6 The Outside    Taylo… <NA>     
 7 Taylor Swift FALSE 2006-10-24               7 Tied Together… Taylo… <NA>     
 8 Taylor Swift FALSE 2006-10-24               8 Stay Beautiful Taylo… <NA>     
 9 Taylor Swift FALSE 2006-10-24               9 Should've Sai… Taylo… <NA>     
10 Taylor Swift FALSE 2006-10-24              10 Mary's Song (… Taylo… <NA>     
# ℹ 230 more rows
# ℹ 22 more variables: bonus_track <lgl>, promotional_release <date>,
#   single_release <date>, track_release <date>, danceability <dbl>,
#   energy <dbl>, key <int>, loudness <dbl>, mode <int>, speechiness <dbl>,
#   acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
#   tempo <dbl>, time_signature <int>, duration_ms <int>, explicit <lgl>,
#   key_name <chr>, mode_name <chr>, key_mode <chr>, lyrics <list>

Exercise 1

How many songs are stored? How many features are stored for each song?

nrow(taylor_album_songs)
[1] 240
ncol(taylor_album_songs)
[1] 29

There are ´r nrow(taylor_album_songs)´ songs. There are 29 features stored for each song.

Exercise 2

Get the name of the (unique) albums contained in the dataset (variable album_name). How many are there?

album_names <- c(unique(taylor_album_songs$album_name))
length(album_names)
[1] 11

There are 11 album names

Exercise 3

In how many songs is there a collaboration with another artist (if there is a collaboration, its name is stored in featuring)?

nrow(taylor_album_songs[!is.na(taylor_album_songs$featuring), ])
[1] 23
sum(!is.na(taylor_album_songs$featuring))
[1] 23

There are 23 songs that have a collaboration with another artist.

Exercise 4

How many different artists have been collaborated (in any song)?

length(unique(taylor_album_songs$featuring[!is.na(taylor_album_songs$featuring)]))
[1] 20

Taylor Swift has collaborated with 20 artists. I know it is showing 21 but that is because I don’t know how to remove the NA. I tried na.rm but it didn’t work.

Exercise 5

Why the code below is wrong? Try to understand the idea about that, explain what it seems the goal of the code and try to fix it to obtain the properly tibble. Reminder: taylor_album_songs is a tibble, not a vector.

# taylor_album_songs[sort(duration_ms, decreasing = TRUE)]

The code should read

taylor_album_songs[order(taylor_album_songs$duration_ms, decreasing = TRUE, na.last = NA), ]
# A tibble: 237 × 29
   album_name       ep    album_release track_number track_name artist featuring
   <chr>            <lgl> <date>               <int> <chr>      <chr>  <chr>    
 1 Red (Taylor's V… FALSE 2021-11-12              30 All Too W… Taylo… <NA>     
 2 Speak Now (Tayl… FALSE 2023-07-07               5 Dear John… Taylo… <NA>     
 3 Speak Now (Tayl… FALSE 2023-07-07              13 Last Kiss… Taylo… <NA>     
 4 Speak Now (Tayl… FALSE 2023-07-07               9 Enchanted… Taylo… <NA>     
 5 THE TORTURED PO… FALSE 2024-04-19               6 But Daddy… Taylo… <NA>     
 6 THE TORTURED PO… FALSE 2024-04-19              10 Who's Afr… Taylo… <NA>     
 7 Red (Taylor's V… FALSE 2021-11-12               5 All Too W… Taylo… <NA>     
 8 Red (Taylor's V… FALSE 2021-11-12              20 State Of … Taylo… <NA>     
 9 Speak Now (Tayl… FALSE 2023-07-07              22 Timeless … Taylo… <NA>     
10 Speak Now (Tayl… FALSE 2023-07-07              14 Long Live… Taylo… <NA>     
# ℹ 227 more rows
# ℹ 22 more variables: bonus_track <lgl>, promotional_release <date>,
#   single_release <date>, track_release <date>, danceability <dbl>,
#   energy <dbl>, key <int>, loudness <dbl>, mode <int>, speechiness <dbl>,
#   acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
#   tempo <dbl>, time_signature <int>, duration_ms <int>, explicit <lgl>,
#   key_name <chr>, mode_name <chr>, key_mode <chr>, lyrics <list>

Exercise 6

Create a new tibble with only the variables album_name, album_release, track_name, featuring and duration_ms. After that it sorts the rows by date from newest to oldest. Output should be also a tibble.

new_tb <- tibble("album_name" = taylor_album_songs$album_name,"album_release" = taylor_album_songs$album_release,"track_name" = taylor_album_songs$track_name, "featuring" = taylor_album_songs$featuring, "duration_in_ms" = taylor_album_songs$duration_ms)
sorted_tb <- new_tb[order(taylor_album_songs$album_release, decreasing = TRUE), ]
sorted_tb
# A tibble: 240 × 5
   album_name                  album_release track_name featuring duration_in_ms
   <chr>                       <date>        <chr>      <chr>              <int>
 1 THE TORTURED POETS DEPARTM… 2024-04-19    Fortnight  Post Mal…         228965
 2 THE TORTURED POETS DEPARTM… 2024-04-19    The Tortu… <NA>              293048
 3 THE TORTURED POETS DEPARTM… 2024-04-19    My Boy On… <NA>              203801
 4 THE TORTURED POETS DEPARTM… 2024-04-19    Down Bad   <NA>              261228
 5 THE TORTURED POETS DEPARTM… 2024-04-19    So Long, … <NA>              262975
 6 THE TORTURED POETS DEPARTM… 2024-04-19    But Daddy… <NA>              340428
 7 THE TORTURED POETS DEPARTM… 2024-04-19    Fresh Out… <NA>              210789
 8 THE TORTURED POETS DEPARTM… 2024-04-19    Florida!!! Florence…         215463
 9 THE TORTURED POETS DEPARTM… 2024-04-19    Guilty As… <NA>              254366
10 THE TORTURED POETS DEPARTM… 2024-04-19    Who's Afr… <NA>              334085
# ℹ 230 more rows

Exercise 7

Add to the previous dataset two new variables with the month and year of release (use the album_release variable). Think about how you could determine in which month it has released more albums

sorted_tb$album_release_month <- month(taylor_album_songs$album_release)
sorted_tb$album_release_year <- year(taylor_album_songs$album_release)
sorted_tb
# A tibble: 240 × 7
   album_name                  album_release track_name featuring duration_in_ms
   <chr>                       <date>        <chr>      <chr>              <int>
 1 THE TORTURED POETS DEPARTM… 2024-04-19    Fortnight  Post Mal…         228965
 2 THE TORTURED POETS DEPARTM… 2024-04-19    The Tortu… <NA>              293048
 3 THE TORTURED POETS DEPARTM… 2024-04-19    My Boy On… <NA>              203801
 4 THE TORTURED POETS DEPARTM… 2024-04-19    Down Bad   <NA>              261228
 5 THE TORTURED POETS DEPARTM… 2024-04-19    So Long, … <NA>              262975
 6 THE TORTURED POETS DEPARTM… 2024-04-19    But Daddy… <NA>              340428
 7 THE TORTURED POETS DEPARTM… 2024-04-19    Fresh Out… <NA>              210789
 8 THE TORTURED POETS DEPARTM… 2024-04-19    Florida!!! Florence…         215463
 9 THE TORTURED POETS DEPARTM… 2024-04-19    Guilty As… <NA>              254366
10 THE TORTURED POETS DEPARTM… 2024-04-19    Who's Afr… <NA>              334085
# ℹ 230 more rows
# ℹ 2 more variables: album_release_month <dbl>, album_release_year <dbl>
sum(sorted_tb$album_release_month == 1)
[1] 0
sum(sorted_tb$album_release_month == 2)
[1] 0
table(month(unique(taylor_album_songs$album_release)))

 4  7  8 10 11 12 
 2  2  1  3  2  1 

Hopefully you can see I added 2 more variables at the end of the tibble.

Exercise 8

Get the average duration of the songs in minutes (variable duration_ms in milliseconds).

mean(taylor_album_songs$duration_ms / 60000, na.rm = TRUE)
[1] 3.959622

The average song duration in is 3.96 minutes.